Research Report MINING SEQUENTIAL PATTERNS: GENERALIZATIONS AND PERFORMANCE IMPROVEMENTS
نویسندگان
چکیده
The problem of mining sequential patterns was recently introduced in [AS95]. We are given a database of sequences, where each sequence is a list of transactions ordered by transaction-time, and each transaction is a set of items. The problem is to discover all sequential patterns with a user-speci ed minimum support, where the support of a pattern is the number of data-sequences that contain the pattern. An example of a sequential pattern is \5% of customers bought `Foundation' and `Ringworld' in one transaction, followed by `Second Foundation' in a later transaction". We generalize the problem as follows. First, we add time constraints that specify a minimum and/or maximum time period between adjacent elements in a pattern. Second, we relax the restriction that the items in an element of a sequential pattern must come from the same transaction, instead allowing the items to be present in a set of transactions whose transaction-times are within a user-speci ed time window. Third, given a user-de ned taxonomy (is-a hierarchy) on items, we allow sequential patterns to include items across all levels of the taxonomy. We present GSP, a new algorithm that discovers these generalized sequential patterns. Empirical evaluation using synthetic and real-life data indicates that GSP is much faster than the AprioriAll algorithm presented in [AS95]. GSP scales linearly with the number of data-sequences, and has very good scale-up properties with respect to the average datasequence size. Also, Department of Computer Science, University of Wisconsin, Madison.
منابع مشابه
Research Report Mining Sequential Patterns: Generalizations and Performance Improvements Limited Distribution Notice Mining Sequential Patterns: Generalizations and Performance Improvements
This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and speciic requests. After outside...
متن کاملMining Sequential Patterns: Generalizations and Performance Improvements
The problem of mining sequential patterns was recently introduced in [3]. We are given a database of sequences, where each sequence is a list of transactions ordered by transaction-time, and each transaction is a set of items. The problem is to discover all sequential patterns with a user-speci ed minimum support, where the support of a pattern is the number of data-sequences that contain the p...
متن کاملChemical Compound Classification with Automatically Mined Structure Patterns
In this paper we propose new methods of chemical structure classification based on the integration of graph database mining from data mining and graph kernel functions from machine learning. In our method, we first identify a set of general graph patterns in chemical structure data. These patterns are then used to augment a graph kernel function that calculates the pairwise similarity between m...
متن کاملA Modified Response Surface Methodology for Knowledge Discovery with Simulations
The response surface methodology (RSM) provides an iterative process for learning that involves the sequential use of experimental design, empirical model building, and analysis of the developed models. This work provides a modified response surface methodology (MRSM) that can be applied to more complex simulation studies. These problems involve a larger number of input variables, multiple meas...
متن کاملGeneralization of Entropy Based Divergence Measures for Symbolic Sequence Analysis
Entropy based measures have been frequently used in symbolic sequence analysis. A symmetrized and smoothed form of Kullback-Leibler divergence or relative entropy, the Jensen-Shannon divergence (JSD), is of particular interest because of its sharing properties with families of other divergence measures and its interpretability in different domains including statistical physics, information theo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998